Improve transformer_squad reader to avoid duplicate tokenization of context in training #263

MagiaSN · 2021-05-15T06:15:30Z

In training, the same context will be used in multiple instances, this pull request add cached_tokenized_context to avoid duplicate tokenization of the context. This reduce the preprocessing time from 30m49s to 13m35s on my machine, and yields exactly the same dev results as the original implementation.

epwalsh

This is a great improvement! Thanks @MagiaSN 🙂

MagiaSN and others added 2 commits May 15, 2021 14:09

Avoid duplicate tokenization of context in training

b58e58c

Merge branch 'main' into transformer_squad_dev

c4b2feb

epwalsh approved these changes May 17, 2021

View reviewed changes

epwalsh merged commit dea182c into allenai:main May 17, 2021

MagiaSN deleted the transformer_squad_dev branch May 18, 2021 04:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve transformer_squad reader to avoid duplicate tokenization of context in training #263

Improve transformer_squad reader to avoid duplicate tokenization of context in training #263

MagiaSN commented May 15, 2021

epwalsh left a comment

Improve transformer_squad reader to avoid duplicate tokenization of context in training #263

Improve transformer_squad reader to avoid duplicate tokenization of context in training #263

Conversation

MagiaSN commented May 15, 2021

epwalsh left a comment

Choose a reason for hiding this comment